Skip to content

Conversation

@fresh-borzoni
Copy link
Contributor

@fresh-borzoni fresh-borzoni commented Jan 11, 2026

Adds column projection support to LogScanner via a unified, Pythonic API:

API:

Table.new_log_scanner(project=None, columns=None)

Usage Examples:

  # Scan all columns (default)
  scanner = await table.new_log_scanner()

  # Project by column index
  scanner = await table.new_log_scanner(project=[0, 2, 4])

  # Project by column name (more Pythonic!)
  scanner = await table.new_log_scanner(columns=['id', 'name', 'email'])

Features:

  • Single unified method with keyword arguments (more Pythonic than separate methods)
  • Mutually exclusive parameters: specify only project or columns, not both
  • Empty lists treated as "all columns" for convenience
  • Early validation with clear error messages
  • Automatic deduplication preserving order

Closes #149

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Left a couple of comments. PTAL

Comment on lines 123 to 125
let rust_scanner = table_scan.create_log_scanner().map_err(|e| {
FlussError::new_err(format!("Failed to create log scanner: {e}"))
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to use FlussError and PyErr within this file, for example line 72 to 75 uses PyErr. Can you clarify when each should be used?

            let rust_scanner = table_scan.create_log_scanner().map_err(|e| {
                PyErr::new::<pyo3::exceptions::PyRuntimeError, _>(format!(
                    "Failed to create log scanner: {e:?}"
                ))
            })?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use FlussError, this is a leftover.

Thank you for catching this 👍

Comment on lines 184 to 198
# Project specific columns by index (C++ parity)
print("\n1. Projection by index [0, 1] (id, name):")
scanner_index = await table.new_log_scanner_with_projection([0, 1])
scanner_index.subscribe(None, None)
df_projected = scanner_index.to_pandas()
print(df_projected.head())
print(f" Projected {df_projected.shape[1]} columns: {list(df_projected.columns)}")

# Project specific columns by name (Python-specific, more idiomatic!)
print("\n2. Projection by name ['name', 'score'] (Pythonic):")
scanner_names = await table.new_log_scanner_with_column_names(["name", "score"])
scanner_names.subscribe(None, None)
df_named = scanner_names.to_pandas()
print(df_named.head())
print(f" Projected {df_named.shape[1]} columns: {list(df_named.columns)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the polling part also be included (as with C++ example)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get what you mean.

to_pandas() polls internally, but if we're talking about adding a separate polling API to Python bindings - we can add it.

Let's file an issue for it, as it's orthogonal to the column projection feature.

Copy link
Contributor Author

@fresh-borzoni fresh-borzoni Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue #152

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realise that to panda polls. Thank you for the clarification

@fresh-borzoni
Copy link
Contributor Author

@leekeiabstraction Thank you for the review.

Addressed the comments. PTAL 🙏

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds column projection capabilities to Python's LogScanner, enabling users to scan only specific columns by index or name, thereby improving performance by reducing data transfer and processing overhead.

Changes:

  • Added new_log_scanner_with_projection() method for index-based column projection (C++ API parity)
  • Added new_log_scanner_with_column_names() method for name-based column projection (Python-idiomatic approach)
  • Added example usage demonstrating both projection methods

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
bindings/python/src/table.rs Implements two new methods for creating log scanners with column projection support, leveraging core project() and project_by_name() APIs
bindings/python/example/example.py Adds demonstration code showing both index-based and name-based column projection usage
Comments suppressed due to low confidence (1)

bindings/python/src/table.rs:188

  • There is significant code duplication between new_log_scanner, new_log_scanner_with_projection, and new_log_scanner_with_column_names. The three methods share nearly identical boilerplate code for creating the FlussTable, getting the admin, and constructing the LogScanner. Consider extracting a private helper function that accepts an optional projection configuration to reduce duplication and improve maintainability.
        future_into_py(py, async move {
            let fluss_table =
                fcore::client::FlussTable::new(&conn, metadata.clone(), table_info.clone());

            // Convert Vec<String> to Vec<&str> for the API
            // Safe: project_by_name validates names immediately, doesn't store refs
            let column_name_refs: Vec<&str> = column_names.iter().map(|s| s.as_str()).collect();

            let table_scan = fluss_table.new_scan();
            let table_scan = table_scan
                .project_by_name(&column_name_refs)
                .map_err(|e| FlussError::new_err(format!("Failed to project columns: {e}")))?;

            let rust_scanner = table_scan.create_log_scanner().map_err(|e| {
                FlussError::new_err(format!("Failed to create log scanner: {e}"))
            })?;

            let admin = conn
                .get_admin()
                .await
                .map_err(|e| FlussError::new_err(e.to_string()))?;

            let py_scanner = LogScanner::from_core(rust_scanner, admin, table_info.clone());
            Python::attach(|py| Py::new(py, py_scanner))
        })

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fresh-borzoni fresh-borzoni force-pushed the python-bindings-projections branch from 4f29135 to 0ff57cd Compare January 15, 2026 15:53
@fresh-borzoni
Copy link
Contributor Author

Changed to unified interface, added silent deduplication and more idiomatic python API with keywords
Updated PR description

@luoyuxia @leekeiabstraction PTAL 🙏

Copy link
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fresh-borzoni Thanks for the pr. Only one comment. PTAL

@fresh-borzoni fresh-borzoni force-pushed the python-bindings-projections branch from 580a018 to f587309 Compare January 15, 2026 16:23
@fresh-borzoni
Copy link
Contributor Author

@luoyuxia Thanks for the review!
Fixed, PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python bindings missing scanner projection support

3 participants